![]() ![]() |
FSSpecs and FSRefsContains information and coding techniques useful in migrating your source from FSSpecs to using FSRefs.
Differences between FSSpecs and FSRefs
The differences which will probably have the biggest impact on your code are that FSRefs cannot represent items which do not exist, and an FSRef is an opaque data structure defined as an array of 80 bytes, the content of which is not documented. In particular, an FSRef does not contain the name of the item to which it refers. This comes as no surprise when you consider that Mac OS X allows the use of file names containing Unicode characters, with a maximum length of 255 UniChars (see FSRefs and long Unicode File Names for more on this).
Converting FSSpecs to FSRefs and back To convert an FSSpec to an FSRef: To obtain an FSSpec from an FSRef: How can I tell if an FSRef is valid? How can I tell if two FSRefs represent the same object? Getting the parent directory of an FSRef How do I specify non-existent items, such as files you plan to create?
This technique is especially useful when storing data returned by
Don't pass FSRefs in AppleEvents. Because FSRefs are not guaranteed to be valid across processes in Mac OS X you shouldn't send them in AppleEvents. MoreFinderEvents contains code demonstrating how to pass aliases to the Finder through AppleEvents.
Like FSSpecs, FSRefs are not guaranteed to be valid across boots in Mac OS 9 or Mac OS X, across processes in Mac OS X, or even across separate launches of the same application in Mac OS X, so don't use them when you need persistent storage. For persistent storage, aliases are still the recommended approach. (Alias Manager)
Can I continue to use FSSpecs? Yes, they continue to be valid file references. An FSSpec's name can be mangled, though, so don't use them to get file names for either storage or display. The names are mangled when the real name can't be stored in a Pascal string, or if the name is longer than 31 characters. In the latter case you get names like "A really, really long file#23A4". The FSSpec still works, it just doesn't contain the item's real name.
Can I replace all of my references to FSSpecs with FSRefs? It depends on your application.
How do I support FSRefs and long Unicode file names in open and save dialogs? You have to use the new NavCreateXXX APIs introduced in Navigation Services 3.0. (Navigation Services) You also have to use these if you want to implement open and save dialogs as sheets (though they don't need to be sheets)
How do I get an FSRef to my application?
If your application is bundled, this will get an FSRef for your executable, not the bundle folder.
LaunchServices is a set of Mac OS X-only APIs for working with files.
Read through <LaunchServices.h> if you want to be up-to-date on
files in Mac OS X, where there are some new issues like bundled applications,
display names, new rules for application binding, and so on. TechNote
2017 "Using Launch Services for discovering document binding and
launching applications", also contains a wealth of information.
Use
The most straightforward approach to getting a files path is with the
API:
Because the contents of an FSRef are undocumented, those contents may vary depending on the format of the volume containing the item to which the FSRef refers. Currently, an FSRef for an item on a HFS or HFS+ volume continues to remain valid even if the item is moved or renamed, presumably because such an FSRef contains a file or directory ID for the item and a volume reference number. In this regard an FSRef is more robust than an FSSpec. Note that this is the current state of affairs, and as with any opaque data structure, any of this could change at any time and should not be relied upon. If you need robust file tracking, use aliases. (Alias Manager) CarbonLib If you're contemplating a CarbonLib project, be aware that FSRefs were introduced with the new HFS+ APIs in Mac OS 9 and hence require Mac OS 9 or later. CarbonLib provides a wrapper around FSRef APIs, but does not actually implement them, so you can't use CarbonLib to get FSRef functionality in any version of Mac OS 8.
FSRefs and long Unicode file namesHow do I get the name of an item from an FSRef?
Since HFSUniStr255s occupy 512 bytes you may want to store names as
CFStringRefs: In addition to conserving memory, Core Foundation provides a wealth of APIs for testing and manipulating CFStrings. There are no such APIs for working with HFSUniStr255 file names. The assumption is that you will do such testing and manipulation with a CFStringRef or CFMutableStringRef obtained from an HFSUniStr255. Note that FSGetCatalogInfo() returns the file system name. The Mac OS X Finder doesn't always display file name extensions. The name the Finder displays is called the display name. If you want the display name in Mac OS X see the LaunchServices section. Note: While technically correct, he definition of HFSUniStr255 is
somewhat misleading. HFS+ disks store file names as UTF-16 in an Apple-modified
form of Normalization Form D (decomposed). This means a single Unicode
code point value can occupy more than one UniChar in an HFSUniStr255,
which in turn means a file name may be limited to fewer than 255 characters
as perceived by normal readers. (read more about
Unicode character terminology) Strictly speaking, the issue here is independent of the source of the CFString, but they are often encountered when deal with Unicode file names. Many of us need to display the name of a file or folder in our applications. Since Mac OS X supports long Unicode file names, there are some related issues. Unicode has a number of things going on under the hood which you wouldn't expect if you are unfamiliar with Unicode and how it works. The following are some basic points to remember when working with Unicode file names. A Unicode string (speaking from the viewpoint of Mac OS X) is a string of UniChars. Such a string can be converted to and from a CFStringRef or a CFMutableStringRef. A single Unicode code point may require multiple UniChars, so never modify a Unicode string by simply removing a range of UniChars or inserting UniChars at an arbitrary offset. Doing so can produce a string which is not what you expect, incorrect, or even leave you with a string which is no longer a legal Unicode string.
Truncating Unicode strings by width
Truncating Unicode strings by length Unfortunately, there is no simple API available which you can use to
correctly truncate a Unicode string to a certain number of characters.
To truncate a file name based on length, you'll need to convert the
name to a UniChar string and use
You can concatenate Unicode strings at will. The individual pieces
will retain their original meaning. For example, you can append ".txt"
to a Unicode string without changing the meaning of the existing string.
Or, you could concatenate English and Arabic (a right-to-left script)
and get the desired result. Determining the width of strings Don't try to estimate the width of a Unicode string based on the number
of UniChars in the string. In addition to the issues of combining characters
and surrogate pairs, Unicode text can contain invisible characters which
are not rendered. Unicode goes beyond the simple encoding of characters
and scripts. There are are several code point values which can be used
to provide hints or instructions to rendering software, but are never
rendered themselves.
Encoding file names in other formats Again, this is not an issue limited to file names, but is included
because people often make the mistake of assuming that the size of the
buffer needed when converting a CFString to a C string is
HFS+ disks store file names as UTF-16 in an Apple-modified form of Normalization Form D (decomposed). This form excludes certain compatibility decompositions and parts of the symbol blocks, in order to assure round-trip of file names to Mac OS encodings (applications using the HFS APIs assume they get the same bytes out that they put in). In Mac OS X 10.2, the decomposition rules used were changed from Unicode 2.0.x (based on an intermediate draft) plus the above-mentioned Apple modifications, to Unicode 3.2 plus the above-mentioned Apple modifications. The Unicode Consortium has committed to not changing the decomposition rules after Unicode 3.2, so we shouldn't have to do this again. The change from 2.0.x to 3.2 was necessary because A) lots of new decompositions had been added, and B) the 2.0.x data was full of errors. Other file systems use different storage formats. UFS disks use UTF-8, HFS disks use Mac OS encodings. AFP (AppleShare) uses Mac OS encodings prior to 3.0, and UTF-16 for 3.0 or later. Notes About Using UnicodeThis could also be called "Unicode for File Names," as there are many aspects of Unicode which won't be discussed here because they aren't needed if all you're doing with Unicode is working with file names in Mac OS X. The reason for focusing on this particular area is that it's an area which every Mac OS X application should be prepared to support. If you're writing a Unicode-savvy word processor, you're going to need a lot more understanding than any glossary notes. Most of the information presented here is from the book, Unicode Demystified by Richard Gillam. However, its 800 pages and may be overkill if all you want to do is handle file names properly in Mac OS X. Unicode is a universal text encoding standard for representing written language in a format suitable for use and storage by computers. It's goal is to allow the encoding of all, or at least all significant forms of writing in use in the world today, as well as many which are no longer used, but are of historical or scholarly interest. There are two major challenges for those new to Unicode. First is getting a handle on the terminology. Second, and directly related to the first, is understanding what constitutes a character in a written language, in Unicode, and how the two are related (i.e., how characters are encoded in Unicode). English is one of the simplest—if not the simplest—of all the world's languages to write and encode for use by computers. Understandably, people whose native language is English tend to make incorrect assumptions about how other languages are written and encoded into Unicode. When code is written based on those false assumptions, it will not work correctly for all languages. Following are a some terms often used in Unicode discussions. A character is an abstract linguistic concept such as "the Latin letter A" or "the Chinese character for 'sun.'" Every character defined in the Unicode standard is assigned a single 21-bit abstract code point value. Apple refers to a code point value in Unicode as a Unicode Scalar Value. MacTypes.h has the following to say:
Surrogate pairs - Unicode sets aside 2,048 code point values (U+8000 - U+DFFF) in the BMP which will never be assigned to actual characters. They are reserved for defining paired combinations to represent characters outside the BMP. These values are called surrogates. The first 1,024 surrogate values (U+D800-U+DBFF) are called high-surrogates, and the remaining 1,024 surrogate values (U+DC00-U+DFFF) are called low-surrogates. A supplementary-plane character (a character not in the BMP) is represented by high-surrogate followed by a low-surrogate. Note that surrogates are only legal when they occur in high-low pairs. An unpaired surrogate is considered an error in Unicode. In case you're just dying to know how a 21–bit code point value is mapped to a surrogate pair, it goes like this: First, subtract 0x10000 from the original code point value to get a 20–bit value. Split those 20 bits down the middle to get two 10–bit sequences. The first 10–bit sequence becomes the lower 10 bits of the high-surrogate value and the second 10–bit sequence becomes the lower 10 bits of the low-surrogate value. Combining marks are code point values which do not represent characters themselves, but apply a mark to a base character which precedes them. Diacritical marks are one kind of combining mark. For example: é = e + ´ (U+0065 LATIN SMALL LETTER E) + (U+0301 COMBINING ACUTE ACCENT) A grapheme is a minimal writing unit is some written language; a mark that is considered a single "character" by an average reader or writer of a particular written language. A grapheme cluster is a sequence of one or more Unicode code points (UniChars) that should be treated as an indivisible unit by most processes operating on Unicode text, such as searching and sorting, hit testing, arrow key movement, and so on. References to the term "cluster" in documentation, or in the headers, such as kUCTextBreakClusterMask, refer to grapheme clusters. A glyph is a concrete visual representation of a character. It's what you see on screen or in print.
Truncating and other manipulations The original intention was for Unicode to represent every character with a single UniChar, but it quickly became obvious that it isn't possible to do this. More than 95,000 characters are now defined in the Unicode standard, far more than can be represented by a single 16-bit value. Only code point values in the Basic Multilingual Plane can be represented with a single UniChar. Furthermore, a significant number of characters are represented as a base character plus one or more diacritical or other combining marks. Assuming that there's a one-to-one relationship between characters and the Unicode characters which represent them leads to one of the most common errors in code which manipulates Unicode strings, which is to truncate a Unicode string at an inappropriate offset. Always use appropriate Unicode-aware APIs to truncate a Unicode string or determine where to insert or remove characters. (See truncation comments.) (Encoded into Unicode as is done by the File Manager in Mac OS X, the string "résumé" contains eight UniChars. Lop off the last one and you'll have a Unicode string for "résume".)
A 32-bit encoding would allow Unicode to provide a direct 1-1 correspondence between code point values and their encoded values, which in turn would eliminate most of those issues about where you can safely insert or truncate characters. 21 bits provides support for about a million characters, roughly 10 times the number currently encoded. But the smallest data type used by computers that's easily manipulated and will contain 21 bits is 32 bits. The downside is that an encoding scheme based on a 32-bit data type would waste a lot of space. If Unicode used a 32-bit encoding scheme—which would allow encoding every code point value in a single code value—it would waste at least 11 bits for every character, and at least 16 bits/character for the vast majority of characters in common use. For example, a 32-bit based HFSUniStr255 (used for file names in Mac OS X) would occupy 1022 bytes, even though most file names consist of less than 40-50 characters in the BMP.
ReferencesLaurence Harris, SkyTag Software
|
Developer Documentation | Technical Q&As | Development Kits | Sample Code |